Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Ruidong Zhang 206294444

Display machine information for reproducibility:

sessionInfo()

Load necessary libraries (you can add more as needed).

rm(list = ls())
library(arrow)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
library(readr)
library(dplyr)
library(tidyr)
install.packages("lubridate")

Display your machine memory.

memuse::Sys.meminfo()

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer

Patient of Interest:

sid <- 10013310

Import the necessary data: transfer

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") %>%
  filter(subject_id == sid) %>%
  collect() %>%
  print(width = Inf)
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 14 × 7
   subject_id  hadm_id transfer_id eventtype
        <dbl>    <dbl>       <dbl> <chr>    
 1   10013310 21243435    31696219 discharge
 2   10013310 21243435    31736720 ED       
 3   10013310 21243435    33511674 transfer 
 4   10013310 21243435    34848129 transfer 
 5   10013310 21243435    38910974 admit    
 6   10013310 22098926    31651850 transfer 
 7   10013310 22098926    32769810 admit    
 8   10013310 22098926    33278851 transfer 
 9   10013310 22098926    34063502 ED       
10   10013310 22098926    36029206 discharge
11   10013310 27682188    30077870 transfer 
12   10013310 27682188    30444898 discharge
13   10013310 27682188    31203589 admit    
14   10013310 27682188    35160955 ED       
   careunit                                        intime             
   <chr>                                           <dttm>             
 1 <NA>                                            2153-06-05 19:58:00
 2 Emergency Department                            2153-05-26 08:56:00
 3 Medicine/Cardiology                             2153-05-26 16:19:26
 4 Medicine/Cardiology                             2153-05-26 14:42:55
 5 Medicine/Cardiology                             2153-05-26 14:18:39
 6 Neuro Intermediate                              2153-06-12 16:31:33
 7 Neuro Surgical Intensive Care Unit (Neuro SICU) 2153-06-10 11:55:42
 8 Medicine                                        2153-06-16 19:03:14
 9 Emergency Department                            2153-06-10 10:40:00
10 <NA>                                            2153-07-21 18:02:28
11 Medicine/Cardiology                             2153-05-07 20:47:19
12 <NA>                                            2153-05-13 15:36:52
13 Coronary Care Unit (CCU)                        2153-05-06 18:28:00
14 Emergency Department                            2153-05-06 10:21:00
   outtime            
   <dttm>             
 1 NA                 
 2 2153-05-26 14:18:39
 3 2153-06-05 19:58:00
 4 2153-05-26 16:19:26
 5 2153-05-26 14:42:55
 6 2153-06-16 19:03:14
 7 2153-06-12 16:31:33
 8 2153-07-21 18:02:28
 9 2153-06-10 11:55:42
10 NA                 
11 2153-05-13 15:36:52
12 NA                 
13 2153-05-07 20:47:19
14 2153-05-06 18:28:00
sid_lab <- arrow::open_dataset("~/labevents_pq") %>%
  filter(subject_id == sid) %>%
  collect() %>% 
  print(width = Inf)
# A tibble: 2,285 × 16
   labevent_id subject_id hadm_id specimen_id itemid order_provider_id
         <int>      <int>   <int>       <int>  <int> <chr>            
 1      153564   10013310      NA     4841989  50887 ""               
 2      153565   10013310      NA     8958046  50934 ""               
 3      153566   10013310      NA     8958046  50947 ""               
 4      153567   10013310      NA     8958046  51003 ""               
 5      153568   10013310      NA     8958046  51678 ""               
 6      153569   10013310      NA    10682517  50933 ""               
 7      153570   10013310      NA    11713499  51133 ""               
 8      153571   10013310      NA    11713499  51146 ""               
 9      153572   10013310      NA    11713499  51200 ""               
10      153573   10013310      NA    11713499  51221 ""               
   charttime           storetime          
   <dttm>              <dttm>             
 1 2153-05-06 04:30:00 NA                 
 2 2153-05-06 04:30:00 2153-05-06 05:22:00
 3 2153-05-06 04:30:00 2153-05-06 05:22:00
 4 2153-05-06 04:30:00 2153-05-06 05:41:00
 5 2153-05-06 04:30:00 2153-05-06 05:22:00
 6 2153-05-06 04:30:00 NA                 
 7 2153-05-06 04:30:00 2153-05-06 05:09:00
 8 2153-05-06 04:30:00 2153-05-06 05:09:00
 9 2153-05-06 04:30:00 2153-05-06 05:09:00
10 2153-05-06 04:30:00 2153-05-06 05:09:00
   value                                    valuenum valueuom ref_range_lower
   <chr>                                       <dbl> <chr>              <dbl>
 1 HOLD.  DISCARD GREATER THAN 24 HRS OLD.     NA    ""                  NA  
 2 5                                            5    ""                  NA  
 3 2                                            2    ""                  NA  
 4 ___                                          2.97 "ng/mL"              0  
 5 14                                          14    ""                  NA  
 6 HOLD.  DISCARD GREATER THAN 4 HOURS OLD.    NA    ""                  NA  
 7 1.90                                         1.9  "K/uL"               1.2
 8 0.2                                          0.2  "%"                  0  
 9 0.1                                          0.1  "%"                  1  
10 32.5                                        32.5  "%"                 34  
   ref_range_upper flag       priority comments
             <dbl> <chr>      <chr>    <chr>   
 1           NA    ""         STAT     "___"   
 2           NA    ""         STAT     ""      
 3           NA    ""         STAT     ""      
 4            0.01 "abnormal" STAT     "___"   
 5           NA    ""         STAT     ""      
 6           NA    ""         STAT     "___"   
 7            3.7  ""         STAT     ""      
 8            1    ""         STAT     ""      
 9            7    "abnormal" STAT     ""      
10           45    "abnormal" STAT     ""      
# ℹ 2,275 more rows
sid_pcd <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") %>%
  filter(subject_id == sid) %>%
  mutate(chartdate = as.POSIXct(chartdate)) %>%
  left_join(read_csv("~/mimic/hosp/d_icd_procedures.csv.gz") %>%
              select(icd_code, long_title), by = "icd_code") %>%
  collect() %>%
  print(width = Inf)
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 9 × 7
  subject_id  hadm_id seq_num chartdate           icd_code icd_version
       <dbl>    <dbl>   <dbl> <dttm>              <chr>          <dbl>
1   10013310 21243435       1 2153-05-27 00:00:00 4A023N7           10
2   10013310 21243435       2 2153-05-27 00:00:00 B2111ZZ           10
3   10013310 21243435       3 2153-05-27 00:00:00 B241ZZ3           10
4   10013310 22098926       1 2153-06-10 00:00:00 03CG3ZZ           10
5   10013310 22098926       2 2153-06-10 00:00:00 3E05317           10
6   10013310 22098926       3 2153-07-15 00:00:00 0DH63UZ           10
7   10013310 22098926       4 2153-06-11 00:00:00 3E0G76Z           10
8   10013310 27682188       1 2153-05-06 00:00:00 027034Z           10
9   10013310 27682188       2 2153-05-06 00:00:00 B211YZZ           10
  long_title                                                                    
  <chr>                                                                         
1 Measurement of Cardiac Sampling and Pressure, Left Heart, Percutaneous Approa…
2 Fluoroscopy of Multiple Coronary Arteries using Low Osmolar Contrast          
3 Ultrasonography of Multiple Coronary Arteries, Intravascular                  
4 Extirpation of Matter from Intracranial Artery, Percutaneous Approach         
5 Introduction of Other Thrombolytic into Peripheral Artery, Percutaneous Appro…
6 Insertion of Feeding Device into Stomach, Percutaneous Approach               
7 Introduction of Nutritional Substance into Upper GI, Via Natural or Artificia…
8 Dilation of Coronary Artery, One Artery with Drug-eluting Intraluminal Device…
9 Fluoroscopy of Multiple Coronary Arteries using Other Contrast                
sid_dcd <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") %>%
  filter(subject_id == sid) %>%
  print(width = Inf)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 71 × 5
   subject_id  hadm_id seq_num icd_code icd_version
        <dbl>    <dbl>   <dbl> <chr>          <dbl>
 1   10013310 21243435       1 I222              10
 2   10013310 21243435       2 I5023             10
 3   10013310 21243435       3 I428              10
 4   10013310 21243435       4 E1142             10
 5   10013310 21243435       5 E1165             10
 6   10013310 21243435       6 I213              10
 7   10013310 21243435       7 I110              10
 8   10013310 21243435       8 I2510             10
 9   10013310 21243435       9 M25511            10
10   10013310 21243435      10 E785              10
# ℹ 61 more rows
sid_patients <- read_csv("~/mimic/hosp/patients.csv.gz") %>%
  filter(subject_id == sid) %>%
  print(width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10013310 F              70        2153 2017 - 2019       2153-11-19
sid_adm <- read_csv("~/mimic/hosp/admissions.csv.gz") %>%
  filter(subject_id == sid) %>%
  print(width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 16
  subject_id  hadm_id admittime           dischtime           deathtime
       <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
1   10013310 21243435 2153-05-26 14:18:00 2153-06-05 19:30:00 NA       
2   10013310 22098926 2153-06-10 11:55:00 2153-07-21 18:00:00 NA       
3   10013310 27682188 2153-05-06 18:03:00 2153-05-13 13:45:00 NA       
  admission_type    admit_provider_id admission_location       
  <chr>             <chr>             <chr>                    
1 OBSERVATION ADMIT P78TNY            INFORMATION NOT AVAILABLE
2 OBSERVATION ADMIT P09IS0            INFORMATION NOT AVAILABLE
3 URGENT            P89ZCW            TRANSFER FROM HOSPITAL   
  discharge_location       insurance language marital_status race         
  <chr>                    <chr>     <chr>    <chr>          <chr>        
1 HOME HEALTH CARE         Medicare  ?        SINGLE         BLACK/AFRICAN
2 SKILLED NURSING FACILITY Medicare  ?        SINGLE         BLACK/AFRICAN
3 HOME HEALTH CARE         Medicare  ?        SINGLE         BLACK/AFRICAN
  edregtime           edouttime           hospital_expire_flag
  <dttm>              <dttm>                             <dbl>
1 2153-05-26 08:56:00 2153-05-26 16:33:00                    0
2 2153-06-10 10:40:00 2153-06-10 11:25:00                    0
3 2153-05-06 10:21:00 2153-05-06 18:28:00                    0
sid_dip <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz") %>%
  collect() %>%
  print(width = Inf)
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 85,257 × 3
   icd_code icd_version
   <chr>          <dbl>
 1 0001               9
 2 0002               9
 3 0003               9
 4 0009               9
 5 001               10
 6 0010               9
 7 0011               9
 8 0012               9
 9 0013               9
10 0014               9
   long_title                                                 
   <chr>                                                      
 1 Therapeutic ultrasound of vessels of head and neck         
 2 Therapeutic ultrasound of heart                            
 3 Therapeutic ultrasound of peripheral vascular vessels      
 4 Other therapeutic ultrasound                               
 5 Central Nervous System and Cranial Nerves, Bypass          
 6 Implantation of chemotherapeutic agent                     
 7 Infusion of drotrecogin alfa (activated)                   
 8 Administration of inhaled nitric oxide                     
 9 Injection or infusion of nesiritide                        
10 Injection or infusion of oxazolidinone class of antibiotics
# ℹ 85,247 more rows
sid_did <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz") %>%
  left_join(sid_dcd, by = "icd_code") %>%
  group_by(subject_id, icd_code) %>%
  filter(row_number() == 1, seq_num <= 3) %>%
  ungroup() %>%
  collect() %>%
  print(width = Inf)
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 8 × 7
  icd_code icd_version.x
  <chr>            <dbl>
1 I2111               10
2 I222                10
3 I428                10
4 I5023               10
5 I618                10
6 I63412              10
7 J690                10
8 N170                10
  long_title                                                                
  <chr>                                                                     
1 ST elevation (STEMI) myocardial infarction involving right coronary artery
2 Subsequent non-ST elevation (NSTEMI) myocardial infarction                
3 Other cardiomyopathies                                                    
4 Acute on chronic systolic (congestive) heart failure                      
5 Other nontraumatic intracerebral hemorrhage                               
6 Cerebral infarction due to embolism of left middle cerebral artery        
7 Pneumonitis due to inhalation of food and vomit                           
8 Acute kidney failure with tubular necrosis                                
  subject_id  hadm_id seq_num icd_version.y
       <dbl>    <dbl>   <dbl>         <dbl>
1   10013310 27682188       1            10
2   10013310 21243435       1            10
3   10013310 21243435       3            10
4   10013310 21243435       2            10
5   10013310 22098926       2            10
6   10013310 22098926       1            10
7   10013310 22098926       3            10
8   10013310 27682188       2            10
# Visualization
library(ggplot2)
library(stringr)

# Define your shapes here, ensure you have at least as many as you have unique values
my_shapes <- c(16, 17, 18, 19, 20, 21, 22, 23, 24, 25)

sid_adt_nomissing <- sid_adt[complete.cases(sid_adt[c('intime', 'outtime', 'careunit')]), ]


ggplot() +
  geom_segment(data = sid_adt_nomissing, aes(x = intime, xend = outtime, y = "ADT", 
                                   yend = "ADT", color = careunit, 
                                   linewidth = str_detect(careunit, "ICU|CCU"))) +
  
  geom_point(data = sid_lab, aes(x = charttime), y = "Lab", shape = 3) +
    
  geom_point(data = sid_pcd,
             aes(x = chartdate, y = "Procedure", shape = long_title)) +
  
  labs(x = "Calendar Time", y = "Event Type", 
       title = paste("Patient ", sid,
                     ", ", sid_patients$gender, ", ",
                     sid_patients$anchor_age, " years old, ", 
                     tolower(sid_adm$race)), 
       subtitle = paste(paste(tolower(sid_did$long_title[1:3]), collapse = "\n")),
       shape = "Procedure") +
  
  scale_color_discrete(name = "Care Unit") +
  
  scale_shape_manual(values = my_shapes) + # Add this line
  
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT")) +
  
  theme(legend.position = "bottom",
        legend.box = "vertical")
Warning: Using linewidth for a discrete variable is not advised.

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Answer

Import the necessary data: chartevents

sid_chart <- arrow::open_dataset("~/chartevents_pq") %>%
  filter(subject_id == sid) %>%
  collect() %>%
  print(width = Inf)
# A tibble: 7,338 × 11
   subject_id  hadm_id  stay_id caregiver_id charttime          
        <int>    <int>    <int>        <int> <dttm>             
 1   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 2   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 3   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 4   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 5   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 6   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 7   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 8   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 9   10013310 22098926 32769810        10285 2153-06-11 02:00:00
10   10013310 22098926 32769810        10285 2153-06-11 02:00:00
   storetime           itemid value valuenum valueuom   warning
   <dttm>               <int> <chr>    <dbl> <chr>        <int>
 1 2153-06-11 03:25:00 223761 98.8      98.8 "°F"             0
 2 2153-06-11 03:25:00 224642 Oral      NA   ""               0
 3 2153-06-11 03:26:00 220046 130      130   "bpm"            0
 4 2153-06-11 03:26:00 220047 50        50   "bpm"            0
 5 2153-06-11 03:26:00 223751 160      160   "mmHg"           0
 6 2153-06-11 03:26:00 223752 90        90   "mmHg"           0
 7 2153-06-11 03:26:00 223769 100      100   "%"              0
 8 2153-06-11 03:26:00 223770 92        92   "%"              0
 9 2153-06-11 03:26:00 224161 35        35   "insp/min"       0
10 2153-06-11 03:26:00 224162 8          8   "insp/min"       0
# ℹ 7,328 more rows
sid_ditems <- read_csv("~/mimic/icu/d_items.csv.gz") %>%
  filter(abbreviation %in% c("HR", "NBPd", "NBPs", "RR", "Temperature F")) %>%
  collect() %>%
  print(width = Inf)
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 5 × 9
  itemid label                                 abbreviation  linksto    
   <dbl> <chr>                                 <chr>         <chr>      
1 220045 Heart Rate                            HR            chartevents
2 220179 Non Invasive Blood Pressure systolic  NBPs          chartevents
3 220180 Non Invasive Blood Pressure diastolic NBPd          chartevents
4 220210 Respiratory Rate                      RR            chartevents
5 223761 Temperature Fahrenheit                Temperature F chartevents
  category            unitname param_type lownormalvalue highnormalvalue
  <chr>               <chr>    <chr>               <dbl>           <dbl>
1 Routine Vital Signs bpm      Numeric                NA              NA
2 Routine Vital Signs mmHg     Numeric                NA              NA
3 Routine Vital Signs mmHg     Numeric                NA              NA
4 Respiratory         insp/min Numeric                NA              NA
5 Routine Vital Signs °F       Numeric                NA              NA
dit_ICU_stays <- sid_chart %>%
  left_join(sid_ditems, by = "itemid") %>%
  filter(!is.na(abbreviation)) %>%
  mutate(value <- as.numeric(value)) %>%
  print(width = Inf)
# A tibble: 549 × 20
   subject_id  hadm_id  stay_id caregiver_id charttime          
        <int>    <int>    <int>        <int> <dttm>             
 1   10013310 22098926 32769810        10285 2153-06-11 02:00:00
 2   10013310 22098926 32769810        10285 2153-06-11 03:00:00
 3   10013310 22098926 32769810        10285 2153-06-11 03:00:00
 4   10013310 22098926 32769810        10285 2153-06-11 03:02:00
 5   10013310 22098926 32769810        10285 2153-06-11 03:02:00
 6   10013310 22098926 32769810        10285 2153-06-12 01:00:00
 7   10013310 22098926 32769810        10285 2153-06-12 01:00:00
 8   10013310 22098926 32769810        10285 2153-06-12 01:03:00
 9   10013310 22098926 32769810        10285 2153-06-12 01:03:00
10   10013310 22098926 32769810        10285 2153-06-12 02:00:00
   storetime           itemid value valuenum valueuom warning
   <dttm>               <dbl> <chr>    <dbl> <chr>      <int>
 1 2153-06-11 03:25:00 223761 98.8      98.8 °F             0
 2 2153-06-11 03:25:00 220045 113      113   bpm            0
 3 2153-06-11 03:25:00 220210 26        26   insp/min       0
 4 2153-06-11 03:25:00 220179 131      131   mmHg           0
 5 2153-06-11 03:25:00 220180 62        62   mmHg           0
 6 2153-06-12 01:37:00 220045 121      121   bpm            0
 7 2153-06-12 01:37:00 220210 25        25   insp/min       0
 8 2153-06-12 01:37:00 220179 134      134   mmHg           0
 9 2153-06-12 01:37:00 220180 70        70   mmHg           0
10 2153-06-12 01:37:00 223761 99        99   °F             0
   label                                 abbreviation  linksto    
   <chr>                                 <chr>         <chr>      
 1 Temperature Fahrenheit                Temperature F chartevents
 2 Heart Rate                            HR            chartevents
 3 Respiratory Rate                      RR            chartevents
 4 Non Invasive Blood Pressure systolic  NBPs          chartevents
 5 Non Invasive Blood Pressure diastolic NBPd          chartevents
 6 Heart Rate                            HR            chartevents
 7 Respiratory Rate                      RR            chartevents
 8 Non Invasive Blood Pressure systolic  NBPs          chartevents
 9 Non Invasive Blood Pressure diastolic NBPd          chartevents
10 Temperature Fahrenheit                Temperature F chartevents
   category            unitname param_type lownormalvalue highnormalvalue
   <chr>               <chr>    <chr>               <dbl>           <dbl>
 1 Routine Vital Signs °F       Numeric                NA              NA
 2 Routine Vital Signs bpm      Numeric                NA              NA
 3 Respiratory         insp/min Numeric                NA              NA
 4 Routine Vital Signs mmHg     Numeric                NA              NA
 5 Routine Vital Signs mmHg     Numeric                NA              NA
 6 Routine Vital Signs bpm      Numeric                NA              NA
 7 Respiratory         insp/min Numeric                NA              NA
 8 Routine Vital Signs mmHg     Numeric                NA              NA
 9 Routine Vital Signs mmHg     Numeric                NA              NA
10 Routine Vital Signs °F       Numeric                NA              NA
   `value <- as.numeric(value)`
                          <dbl>
 1                         98.8
 2                        113  
 3                         26  
 4                        131  
 5                         62  
 6                        121  
 7                         25  
 8                        134  
 9                         70  
10                         99  
# ℹ 539 more rows
library(ggplot2)
library(scales)

Attaching package: 'scales'
The following object is masked from 'package:readr':

    col_factor
library(dplyr)

# Assuming dit_ICU_stays is already loaded and contains the correct data

# Make sure the 'value' column is numeric
dit_ICU_stays <- dit_ICU_stays %>%
  mutate(value = as.numeric(value)) 

# Plot the graph
ggplot(data = dit_ICU_stays, aes(x = charttime, y = value, color = abbreviation)) + 
  geom_line() +
  geom_point() +
  facet_grid(abbreviation ~ stay_id, scales = "free") + # both axes have free scales
  labs(title = paste("Patient", sid, "ICU stays - Vitals"),
       x = "",
       y = "") +
  theme(legend.title = element_blank(),
        axis.text.x = element_text(hjust = 1, size = 6), 
        axis.text.y = element_text(size = 8)) + 
  scale_x_datetime(labels = date_format("%b %d %H:%M")) +
  scale_y_continuous(labels = comma)

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") %>%
  collect() %>%
  print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer

# the numer of unique subject_id
icustays_tble %>%
  count(subject_id) %>%
  nrow()
[1] 50920

There are 50920 unique subject_id with sid 10013310.

icustays_sum <- icustays_tble %>%
  group_by(subject_id) %>%
  summarize(n_icu_stays = n_distinct(stay_id)) %>%
  collect() %>%
  print(width = Inf)
# A tibble: 50,920 × 2
   subject_id n_icu_stays
        <dbl>       <int>
 1   10000032           1
 2   10000980           1
 3   10001217           2
 4   10001725           1
 5   10001884           1
 6   10002013           1
 7   10002155           3
 8   10002348           1
 9   10002428           4
10   10002430           1
# ℹ 50,910 more rows

Based on the summary, a subject_id can have multiple ICU stays.

# the number of ICU stays per `subject_id` by graphs
icustays_plot <- ggplot(icustays_sum, aes(x = n_icu_stays)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(x = "Number of ICU Stays", y = "Count", 
       title = "Distribution of ICU Stays per Subject") +
  theme_minimal()

print(icustays_plot)

Summarize the number of ICU stays per subject_id by graphs.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz") %>%
  collect() %>%
  print(admissions_tble, width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag
   <dttm>                             <dbl>
 1 2180-05-06 23:30:00                    0
 2 2180-06-26 21:31:00                    0
 3 2180-08-06 01:44:00                    0
 4 2180-07-23 14:00:00                    0
 5 2160-03-04 06:26:00                    0
 6 2160-11-21 03:20:00                    0
 7 2160-12-28 16:07:00                    0
 8 2163-09-28 09:04:00                    0
 9 2181-11-15 09:57:00                    0
10 2183-09-18 20:20:00                    0
# ℹ 431,221 more rows

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer

library(lubridate) # load package if you haven't already

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
# summary of admissions
 admissions_summary <- admissions_tble %>%
  mutate(admission_hour = hour(admittime),
         admission_minute = minute(admittime),
         dischtime = as.POSIXct(dischtime),
         admission_length = dischtime - admittime) %>%
  mutate(admission_length_seconds = as.numeric(admission_length)) %>%
  collect()

# Print admissions summary
print(admissions_summary, width = Inf)
# A tibble: 431,231 × 20
   subject_id  hadm_id admittime           dischtime           deathtime
        <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00 NA       
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00 NA       
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00 NA       
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00 NA       
   admission_type    admit_provider_id admission_location     discharge_location
   <chr>             <chr>             <chr>                  <chr>             
 1 URGENT            P874LG            TRANSFER FROM HOSPITAL HOME              
 2 EW EMER.          P09Q6Y            EMERGENCY ROOM         HOME              
 3 EW EMER.          P60CC5            EMERGENCY ROOM         HOSPICE           
 4 EW EMER.          P30KEH            EMERGENCY ROOM         HOME              
 5 EU OBSERVATION    P51VDL            EMERGENCY ROOM         <NA>              
 6 EW EMER.          P6957U            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
 7 EU OBSERVATION    P63AD6            PHYSICIAN REFERRAL     <NA>              
 8 EU OBSERVATION    P38XXV            EMERGENCY ROOM         <NA>              
 9 EU OBSERVATION    P2358X            EMERGENCY ROOM         <NA>              
10 OBSERVATION ADMIT P75S70            WALK-IN/SELF REFERRAL  HOME HEALTH CARE  
   insurance language marital_status race  edregtime          
   <chr>     <chr>    <chr>          <chr> <dttm>             
 1 Other     ENGLISH  WIDOWED        WHITE 2180-05-06 19:17:00
 2 Medicaid  ENGLISH  WIDOWED        WHITE 2180-06-26 15:54:00
 3 Medicaid  ENGLISH  WIDOWED        WHITE 2180-08-05 20:58:00
 4 Medicaid  ENGLISH  WIDOWED        WHITE 2180-07-23 05:54:00
 5 Other     ENGLISH  SINGLE         WHITE 2160-03-03 21:55:00
 6 Medicare  ENGLISH  MARRIED        WHITE 2160-11-20 20:36:00
 7 Medicare  ENGLISH  MARRIED        WHITE 2160-12-27 18:32:00
 8 Other     ENGLISH  SINGLE         WHITE 2163-09-27 16:18:00
 9 Other     ENGLISH  DIVORCED       WHITE 2181-11-14 21:51:00
10 Other     ENGLISH  DIVORCED       WHITE 2183-09-18 08:41:00
   edouttime           hospital_expire_flag admission_hour admission_minute
   <dttm>                             <dbl>          <int>            <int>
 1 2180-05-06 23:30:00                    0             22               23
 2 2180-06-26 21:31:00                    0             18               27
 3 2180-08-06 01:44:00                    0             23               44
 4 2180-07-23 14:00:00                    0             12               35
 5 2160-03-04 06:26:00                    0             23               16
 6 2160-11-21 03:20:00                    0              1               56
 7 2160-12-28 16:07:00                    0              5               11
 8 2163-09-28 09:04:00                    0             23               17
 9 2181-11-15 09:57:00                    0              2                5
10 2183-09-18 20:20:00                    0             18               10
   admission_length admission_length_seconds
   <drtn>                              <dbl>
 1  67920 secs                         67920
 2  87720 secs                         87720
 3 151560 secs                        151560
 4 192000 secs                        192000
 5  25800 secs                         25800
 6 392160 secs                        392160
 7  39360 secs                         39360
 8  35220 secs                         35220
 9  46020 secs                         46020
10 253200 secs                        253200
# ℹ 431,221 more rows
# number of admissions per patient
admissions_summary %>%
  count(subject_id) %>%
  ggplot(aes(x = n)) +
  geom_bar() +
  labs(x = "Number of Admissions", y = "Number of Patients", 
       title = "Number of Admissions per Patient")

# admission hour
admissions_summary %>%
  ggplot(aes(x = admission_hour)) +
  geom_histogram(bins = 24) + # Assuming admission_hour is a 24-hour format
  labs(x = "Admission Hour", y = "Count", 
       title = "Distribution of Admission Hours")

# admission minute
admissions_summary %>%
  ggplot(aes(x = admission_minute)) +
  geom_histogram(bins = 60) + # Assuming admission_minute ranges from 0 to 59
  labs(x = "Admission Minute", y = "Count", 
       title = "Distribution of Admission Minutes")

# length of hospital stay
#second to day

admissions_summary %>%
  ggplot(aes(x = admission_length_seconds)) +
  geom_histogram(binwidth = (max(admissions_summary$admission_length_seconds) - min(admissions_summary$admission_length_seconds)) / 30) +
  labs(x = "Length of Stay (Seconds)", y = "Count", 
       title = "Distribution of Length of Hospital Stay")

explanation of patterns:

  • The number of admissions per patient is right-skewed, with most patients having only one admission.
  • The distribution of admission hours is relatively uniform, with a slight increase in the morning hours.
  • The distribution of admission minutes is relatively uniform, with a slight increase at the 0th minute.
  • The distribution of length of hospital stay is right-skewed, with most stays being less than 100,000 seconds (about 27.8 hours).

In conclusion, the patterns observed in the data are consistent with what we would expect in a hospital setting. The distribution of admission hours and minutes is relatively uniform, which is expected given that patients can be admitted at any time of the day. The distribution of length of hospital stay is right-skewed, with most stays being relatively short, which is also expected given that most patients are admitted for short-term care.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz") %>%
  collect() %>%
  print(patients_tble, width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 299,712 × 6
   subject_id gender anchor_age anchor_year anchor_year_group dod       
        <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
 1   10000032 F              52        2180 2014 - 2016       2180-09-09
 2   10000048 F              23        2126 2008 - 2010       NA        
 3   10000068 F              19        2160 2008 - 2010       NA        
 4   10000084 M              72        2160 2017 - 2019       2161-02-13
 5   10000102 F              27        2136 2008 - 2010       NA        
 6   10000108 M              25        2163 2014 - 2016       NA        
 7   10000115 M              24        2154 2017 - 2019       NA        
 8   10000117 F              48        2174 2008 - 2010       NA        
 9   10000178 F              59        2157 2017 - 2019       NA        
10   10000248 M              34        2192 2014 - 2016       NA        
# ℹ 299,702 more rows

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer

# Summary statistics for gender
gender_summary <- patients_tble %>%
  count(gender) 

# Summary statistics for anchor_age
anchor_age_summary <- patients_tble %>%
  summarize(mean_anchor_age = mean(anchor_age),
            median_anchor_age = median(anchor_age),
            min_anchor_age = min(anchor_age),
            max_anchor_age = max(anchor_age))

gender_plot <- ggplot(gender_summary, aes(x = gender, y = n)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Gender", y = "Count", 
       title = "Gender Distribution of Patients") +
  theme_minimal()

print(gender_plot)

anchor_age_plot <- ggplot(patients_tble, aes(x = anchor_age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(x = "Anchor Age", y = "Count", 
       title = "Distribution of Anchor Age") +
  theme_minimal()

print(anchor_age_plot)

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz", col_types = cols(
  label = col_character(),
  fluid = col_character(),
  category = col_character(),
  itemid = col_double()  # specify itemid as double to avoid type mismatch during join
))
print(d_labitems, width = Inf)
# A tibble: 1,622 × 4
   itemid label                               fluid category 
    <dbl> <chr>                               <chr> <chr>    
 1  50801 Alveolar-arterial Gradient          Blood Blood Gas
 2  50802 Base Excess                         Blood Blood Gas
 3  50803 Calculated Bicarbonate, Whole Blood Blood Blood Gas
 4  50804 Calculated Total CO2                Blood Blood Gas
 5  50805 Carboxyhemoglobin                   Blood Blood Gas
 6  50806 Chloride, Whole Blood               Blood Blood Gas
 7  50808 Free Calcium                        Blood Blood Gas
 8  50809 Glucose                             Blood Blood Gas
 9  50810 Hematocrit, Calculated              Blood Blood Gas
10  50811 Hemoglobin                          Blood Blood Gas
# ℹ 1,612 more rows
icustays_tble_Q5 <- icustays_tble %>%
  mutate(subject_id = as.character(subject_id))
library(lubridate)
library(tidyr)
# Read in the lab items data, specifying column types
d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz", col_types = cols(
  label = col_character(),
  fluid = col_character(),
  category = col_character(),
  itemid = col_double() 
))
print(d_labitems, width = Inf)
# A tibble: 1,622 × 4
   itemid label                               fluid category 
    <dbl> <chr>                               <chr> <chr>    
 1  50801 Alveolar-arterial Gradient          Blood Blood Gas
 2  50802 Base Excess                         Blood Blood Gas
 3  50803 Calculated Bicarbonate, Whole Blood Blood Blood Gas
 4  50804 Calculated Total CO2                Blood Blood Gas
 5  50805 Carboxyhemoglobin                   Blood Blood Gas
 6  50806 Chloride, Whole Blood               Blood Blood Gas
 7  50808 Free Calcium                        Blood Blood Gas
 8  50809 Glucose                             Blood Blood Gas
 9  50810 Hematocrit, Calculated              Blood Blood Gas
10  50811 Hemoglobin                          Blood Blood Gas
# ℹ 1,612 more rows
# Convert subject_id to character to ensure it matches the other table's type
icustays_tble_Q5 <- icustays_tble %>%
  mutate(subject_id = as.character(subject_id))

# Open the lab events dataset and filter for relevant data
labevents_tble <- arrow::open_dataset("~/labevents.parquet", 
                                      format = "parquet") %>%
  select(subject_id, itemid, valuenum, storetime) %>%
  mutate(subject_id = as.character(subject_id), # Ensure subject_id is character
         itemid = as.numeric(itemid)) %>% # Convert itemid to numeric
  filter(subject_id %in% icustays_tble_Q5$subject_id,
         itemid %in% c(50912, 50971, 50983, 50902, 
                       50882, 51221, 51301, 50931)) %>%
  collect() %>%
  left_join(icustays_tble_Q5, by = "subject_id") %>%
  left_join(d_labitems, by = "itemid") %>%
  select(subject_id, stay_id, itemid, valuenum, storetime, intime, label) %>%
  filter(storetime < intime) %>%
  group_by(subject_id, stay_id, itemid) %>%
  slice_max(order_by = storetime, n = 1) %>%
  ungroup()
Warning in left_join(., icustays_tble_Q5, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 5 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# Reshape the data to a wide format
labevents_final_tble <- labevents_tble %>%
  select(subject_id, stay_id, label, valuenum) %>%
  pivot_wider(names_from = label, values_from = valuenum,
              values_fn = list(valuenum = last)) 

labevents_final_tble <- labevents_final_tble %>%
  rename_with(tolower)

# Rename the 'white blood cells' column to 'wbc'
labevents_final_tble <- labevents_final_tble %>%
  rename(wbc = 'white blood cells')

# Display the final table
labevents_final_tble
# A tibble: 68,467 × 10
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
   <chr>         <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1 10000032   39553978          25       95        0.7     102       6.7    126
 2 10000980   39765666          21      109        2.3      89       3.9    144
 3 10001217   34592300          30      104        0.5      87       4.1    142
 4 10001217   37067082          22      108        0.6     112       4.2    142
 5 10001725   31205490          NA       98       NA        NA       4.1    139
 6 10001884   37510196          30       88        1.1     141       4.5    130
 7 10002013   39060235          24      102        0.9     288       3.5    137
 8 10002155   31090461          23       98        2.8     117       4.9    135
 9 10002155   32358465          26       85        1.4     133       5.7    120
10 10002155   33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer

# Import chartevents
d_items <- read_csv("~/mimic/icu/d_items.csv.gz")
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Specify vital signs itemids
vital_itemids <- c(220045, 220179, 220180, 223761, 220210)
library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:lubridate':

    duration
The following object is masked from 'package:utils':

    timestamp
library(dplyr)
library(tidyr)
# Read the dataset and filter by vital_itemids
chartevents_tble <- arrow::open_dataset("~/chartevents_pq", format = "parquet") %>%
  filter(itemid %in% vital_itemids) %>%
  collect()

# Avoid data overlap
icustays_tble_Q6 <- icustays_tble %>%
  mutate(subject_id = as.character(subject_id))

chartevents_tble <- chartevents_tble %>%
  mutate(subject_id = as.character(subject_id))

# Join and filter data
chartevents_icustays <- chartevents_tble %>%
  inner_join(icustays_tble_Q6, by = c("subject_id", "hadm_id", "stay_id")) %>%
  filter(charttime >= intime & charttime <= outtime)

# Get the first measurement for each itemid
chartevents_firstm <- chartevents_icustays %>%
  group_by(subject_id, stay_id, itemid) %>%
  summarise(value = first(value, order_by = charttime), .groups = "drop")

# Pivot the table wider and then rename the columns directly
chartevents_final_tble <- chartevents_firstm %>%
  pivot_wider(
    id_cols = c(subject_id, stay_id),
    names_from = itemid,
    values_from = value
  ) %>%
  rename(
    heart_rate = `220045`,
    non_invasive_blood_pressure_systolic = `220179`,
    non_invasive_blood_pressure_diastolic = `220180`,
    temperature_fahrenheit = `223761`,
    respiratory_rate = `220210`
  )

# Display the final table
print(chartevents_final_tble)
# A tibble: 73,164 × 7
   subject_id  stay_id heart_rate non_invasive_blood_pr…¹ non_invasive_blood_p…²
   <chr>         <dbl> <chr>      <chr>                   <chr>                 
 1 10000032   39553978 91         84                      48                    
 2 10000980   39765666 77         150                     77                    
 3 10001217   34592300 96         167                     95                    
 4 10001217   37067082 86         151                     90                    
 5 10001725   31205490 55         73                      56                    
 6 10001884   37510196 38         180                     12                    
 7 10002013   39060235 80         104                     70                    
 8 10002155   31090461 94         118                     51                    
 9 10002155   32358465 98         109                     65                    
10 10002155   33685454 68         126                     61                    
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹​non_invasive_blood_pressure_systolic,
#   ²​non_invasive_blood_pressure_diastolic
# ℹ 2 more variables: respiratory_rate <chr>, temperature_fahrenheit <chr>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq and chartevents_pq folders available at the current working directory, for example, by a symbolic link.

Answer

# Merge icustays_tble, admissions_tble, patients_tble, lab_summary, and chart_summary
mimic_icu_cohort <- icustays_tble %>%
  inner_join(patients_tble, by = "subject_id") %>%
  mutate(age_intime = year(intime) - anchor_year + anchor_age) %>% # calculate age at intime
  filter(year(intime) - anchor_year + anchor_age >= 18) %>%
  inner_join(admissions_tble, by = "hadm_id") %>%
  select(-contains(".y")) %>%
  rename(subject_id = subject_id.x) %>%
  print(width = Inf)
# A tibble: 73,181 × 28
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los gender anchor_age anchor_year anchor_year_group
   <dttm>              <dbl> <chr>       <dbl>       <dbl> <chr>            
 1 2180-07-23 23:50:47 0.410 F              52        2180 2014 - 2016      
 2 2189-06-27 20:38:27 0.498 F              73        2186 2008 - 2010      
 3 2157-11-21 22:08:00 1.12  F              55        2157 2011 - 2013      
 4 2157-12-20 14:27:41 0.948 F              55        2157 2011 - 2013      
 5 2110-04-12 23:59:56 1.34  F              46        2110 2011 - 2013      
 6 2131-01-20 08:27:30 9.17  F              68        2122 2008 - 2010      
 7 2160-05-19 17:33:33 1.31  F              53        2156 2008 - 2010      
 8 2131-03-10 18:09:21 0.859 F              80        2128 2008 - 2010      
 9 2129-08-10 17:02:38 6.18  F              80        2128 2008 - 2010      
10 2130-09-27 22:13:41 3.89  F              80        2128 2008 - 2010      
   dod        age_intime admittime           dischtime          
   <date>          <dbl> <dttm>              <dttm>             
 1 2180-09-09         52 2180-07-23 12:35:00 2180-07-25 17:55:00
 2 2193-08-26         76 2189-06-27 07:38:00 2189-07-03 03:00:00
 3 NA                 55 2157-11-18 22:56:00 2157-11-25 18:00:00
 4 NA                 55 2157-12-18 16:58:00 2157-12-24 14:55:00
 5 NA                 46 2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20         77 2131-01-07 20:39:00 2131-01-20 05:15:00
 7 NA                 57 2160-05-18 07:45:00 2160-05-23 13:30:00
 8 2131-03-10         83 2131-03-09 20:33:00 2131-03-10 01:55:00
 9 2131-03-10         81 2129-08-04 12:44:00 2129-08-18 16:53:00
10 2131-03-10         82 2130-09-23 21:59:00 2130-09-29 18:55:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  EW EMER.                    P4645A           
 4 NA                  DIRECT EMER.                P99698           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-20 05:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 2131-03-10 21:53:00 EW EMER.                    P80515           
 9 NA                  EW EMER.                    P05HUO           
10 NA                  EW EMER.                    P3529J           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 4 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     DIED                         Other     ENGLISH 
 9 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
10 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-23 05:54:00 2180-07-23 14:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-27 06:25:00 2189-06-27 08:42:00
 3 MARRIED        WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 4 MARRIED        WHITE                  NA                  NA                 
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 13:36:00 2131-01-07 22:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
 9 MARRIED        WHITE                  2129-08-04 11:00:00 2129-08-04 12:35:00
10 MARRIED        WHITE                  2130-09-23 19:59:00 2130-09-24 00:50:00
   hospital_expire_flag
                  <dbl>
 1                    0
 2                    0
 3                    0
 4                    0
 5                    0
 6                    1
 7                    0
 8                    1
 9                    0
10                    0
# ℹ 73,171 more rows
# Filter labevents_tble for the last lab measurement before ICU stay
mimic_icu_cohort <- mimic_icu_cohort %>%
  left_join(chartevents_final_tble, by = "stay_id")
mimic_icu_cohort
# A tibble: 73,181 × 34
   subject_id.x hadm_id stay_id first_careunit last_careunit intime             
          <dbl>   <dbl>   <dbl> <chr>          <chr>         <dttm>             
 1     10000032  2.91e7  3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2     10000980  2.69e7  3.98e7 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3     10001217  2.46e7  3.71e7 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4     10001217  2.77e7  3.46e7 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5     10001725  2.56e7  3.12e7 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6     10001884  2.62e7  3.75e7 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7     10002013  2.36e7  3.91e7 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8     10002155  2.03e7  3.24e7 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9     10002155  2.38e7  3.37e7 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10     10002155  2.90e7  3.11e7 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 28 more variables: outtime <dttm>, los <dbl>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>,
#   age_intime <dbl>, admittime <dttm>, dischtime <dttm>, deathtime <dttm>,
#   admission_type <chr>, admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>, …
mimic_icu_cohort <- mimic_icu_cohort %>%
  select(-contains(".y")) %>%
  rename(subject_id = subject_id.x)
mimic_icu_cohort
# A tibble: 73,181 × 33
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 27 more variables: outtime <dttm>, los <dbl>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>,
#   age_intime <dbl>, admittime <dttm>, dischtime <dttm>, deathtime <dttm>,
#   admission_type <chr>, admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>, …
# Merge the labevents data
# Ensure that stay_id is of the same type in both tables
mimic_icu_cohort <- mutate(mimic_icu_cohort, stay_id = as.character(stay_id), subject_id = as.character(subject_id))
labevents_final_tble <- mutate(labevents_final_tble, stay_id = as.character(stay_id), subject_id = as.character(subject_id))

# Remove potential duplicates in labevents_final_tble
labevents_final_tble <- labevents_final_tble %>% distinct(subject_id, stay_id, .keep_all = TRUE)

# Perform the left join
mimic_icu_cohort <- mimic_icu_cohort %>%
  left_join(labevents_final_tble, by = c("subject_id", "stay_id"))

# Remove unwanted .y columns and rename .x columns
colnames(mimic_icu_cohort) <- gsub("\\.y$", "", colnames(mimic_icu_cohort))
colnames(mimic_icu_cohort) <- gsub("\\.x$", "", colnames(mimic_icu_cohort))

# Optionally, select only the columns you want to keep
mimic_icu_cohort <- select(mimic_icu_cohort, -contains(".y"))

# Print the combined table with adjusted width
print(mimic_icu_cohort, width = Inf)
# A tibble: 73,181 × 41
   subject_id  hadm_id stay_id  first_careunit                                  
   <chr>         <dbl> <chr>    <chr>                                           
 1 10000032   29079034 39553978 Medical Intensive Care Unit (MICU)              
 2 10000980   26913865 39765666 Medical Intensive Care Unit (MICU)              
 3 10001217   24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4 10001217   27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5 10001725   25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6 10001884   26184834 37510196 Medical Intensive Care Unit (MICU)              
 7 10002013   23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8 10002155   20345487 32358465 Medical Intensive Care Unit (MICU)              
 9 10002155   23822395 33685454 Coronary Care Unit (CCU)                        
10 10002155   28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los gender anchor_age anchor_year anchor_year_group
   <dttm>              <dbl> <chr>       <dbl>       <dbl> <chr>            
 1 2180-07-23 23:50:47 0.410 F              52        2180 2014 - 2016      
 2 2189-06-27 20:38:27 0.498 F              73        2186 2008 - 2010      
 3 2157-11-21 22:08:00 1.12  F              55        2157 2011 - 2013      
 4 2157-12-20 14:27:41 0.948 F              55        2157 2011 - 2013      
 5 2110-04-12 23:59:56 1.34  F              46        2110 2011 - 2013      
 6 2131-01-20 08:27:30 9.17  F              68        2122 2008 - 2010      
 7 2160-05-19 17:33:33 1.31  F              53        2156 2008 - 2010      
 8 2131-03-10 18:09:21 0.859 F              80        2128 2008 - 2010      
 9 2129-08-10 17:02:38 6.18  F              80        2128 2008 - 2010      
10 2130-09-27 22:13:41 3.89  F              80        2128 2008 - 2010      
   dod        age_intime admittime           dischtime          
   <date>          <dbl> <dttm>              <dttm>             
 1 2180-09-09         52 2180-07-23 12:35:00 2180-07-25 17:55:00
 2 2193-08-26         76 2189-06-27 07:38:00 2189-07-03 03:00:00
 3 NA                 55 2157-11-18 22:56:00 2157-11-25 18:00:00
 4 NA                 55 2157-12-18 16:58:00 2157-12-24 14:55:00
 5 NA                 46 2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20         77 2131-01-07 20:39:00 2131-01-20 05:15:00
 7 NA                 57 2160-05-18 07:45:00 2160-05-23 13:30:00
 8 2131-03-10         83 2131-03-09 20:33:00 2131-03-10 01:55:00
 9 2131-03-10         81 2129-08-04 12:44:00 2129-08-18 16:53:00
10 2131-03-10         82 2130-09-23 21:59:00 2130-09-29 18:55:00
   deathtime           admission_type              admit_provider_id
   <dttm>              <chr>                       <chr>            
 1 NA                  EW EMER.                    P30KEH           
 2 NA                  EW EMER.                    P30KEH           
 3 NA                  EW EMER.                    P4645A           
 4 NA                  DIRECT EMER.                P99698           
 5 NA                  EW EMER.                    P35SU0           
 6 2131-01-20 05:15:00 OBSERVATION ADMIT           P874LG           
 7 NA                  SURGICAL SAME DAY ADMISSION P47E1G           
 8 2131-03-10 21:53:00 EW EMER.                    P80515           
 9 NA                  EW EMER.                    P05HUO           
10 NA                  EW EMER.                    P3529J           
   admission_location discharge_location           insurance language
   <chr>              <chr>                        <chr>     <chr>   
 1 EMERGENCY ROOM     HOME                         Medicaid  ENGLISH 
 2 EMERGENCY ROOM     HOME HEALTH CARE             Medicare  ENGLISH 
 3 EMERGENCY ROOM     HOME HEALTH CARE             Other     ?       
 4 PHYSICIAN REFERRAL HOME HEALTH CARE             Other     ?       
 5 PACU               HOME                         Other     ENGLISH 
 6 EMERGENCY ROOM     DIED                         Medicare  ENGLISH 
 7 PHYSICIAN REFERRAL HOME HEALTH CARE             Medicare  ENGLISH 
 8 EMERGENCY ROOM     DIED                         Other     ENGLISH 
 9 PROCEDURE SITE     CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH 
10 EMERGENCY ROOM     HOME HEALTH CARE             Other     ENGLISH 
   marital_status race                   edregtime           edouttime          
   <chr>          <chr>                  <dttm>              <dttm>             
 1 WIDOWED        WHITE                  2180-07-23 05:54:00 2180-07-23 14:00:00
 2 MARRIED        BLACK/AFRICAN AMERICAN 2189-06-27 06:25:00 2189-06-27 08:42:00
 3 MARRIED        WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 4 MARRIED        WHITE                  NA                  NA                 
 5 MARRIED        WHITE                  NA                  NA                 
 6 MARRIED        BLACK/AFRICAN AMERICAN 2131-01-07 13:36:00 2131-01-07 22:13:00
 7 SINGLE         OTHER                  NA                  NA                 
 8 MARRIED        WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
 9 MARRIED        WHITE                  2129-08-04 11:00:00 2129-08-04 12:35:00
10 MARRIED        WHITE                  2130-09-23 19:59:00 2130-09-24 00:50:00
   hospital_expire_flag heart_rate non_invasive_blood_pressure_systolic
                  <dbl> <chr>      <chr>                               
 1                    0 91         84                                  
 2                    0 77         150                                 
 3                    0 86         151                                 
 4                    0 96         167                                 
 5                    0 55         73                                  
 6                    1 38         180                                 
 7                    0 80         104                                 
 8                    1 98         109                                 
 9                    0 68         126                                 
10                    0 94         118                                 
   non_invasive_blood_pressure_diastolic respiratory_rate temperature_fahrenheit
   <chr>                                 <chr>            <chr>                 
 1 48                                    24               98.7                  
 2 77                                    23               98                    
 3 90                                    18               98.5                  
 4 95                                    11               97.6                  
 5 56                                    19               97.7                  
 6 12                                    10               98.1                  
 7 70                                    14               97.2                  
 8 65                                    23               97.7                  
 9 61                                    18               95.9                  
10 51                                    18               96.9                  
   bicarbonate chloride creatinine glucose potassium sodium hematocrit   wbc
         <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>      <dbl> <dbl>
 1          25       95        0.7     102       6.7    126       41.1   6.9
 2          21      109        2.3      89       3.9    144       27.3   5.3
 3          22      108        0.6     112       4.2    142       38.1  15.7
 4          30      104        0.5      87       4.1    142       37.4   5.4
 5          NA       98       NA        NA       4.1    139       NA    NA  
 6          30       88        1.1     141       4.5    130       39.7  12.2
 7          24      102        0.9     288       3.5    137       34.9   7.2
 8          26       85        1.4     133       5.7    120       22.4   9.8
 9          24      105        1.1     138       4.6    139       39.7   7.9
10          23       98        2.8     117       4.9    135       25.5  17.9
# ℹ 73,171 more rows

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay

  • Length of ICU stay los vs first ICU unit

Answer

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)
library(ggplot2)
library(dplyr)

demographic_func <- function(data, demographic) {
  data %>%
    group_by(!!sym(demographic)) %>%
    summarize(mean_los = mean(los, na.rm = TRUE), 
              median_los = median(los, na.rm = TRUE),
              n = n()) %>%
    ungroup()
}
# Function to create plots for categorical demographics
plot_categorical <- function(data, demographic) {
  ggplot(data, aes_string(x = demographic, y = "mean_los", fill = demographic)) +
    geom_col() +
    theme_minimal() +
    labs(title = paste("Average Length of ICU Stay by", demographic),
         x = demographic,
         y = "Average Length of Stay (days)") +
    theme(axis.text.x = element_text(, hjust = 1, vjust = 1, size = 8, 
                                     face = "plain", lineheight = 0.8, margin = 
                                       margin(t = 0, r = 0, b = 10, l = 0))) +
    scale_fill_brewer(palette = "Set3")
}

# Function to create a scatter plot for continuous demographics like age
plot_continuous <- function(data, demographic, fill_color = "lightblue") {
  ggplot(data, aes_string(x = demographic, y = "mean_los")) +
    geom_point(alpha = 0.5, color = fill_color) +
    geom_smooth(method = "loess", color = "blue", se = FALSE) +
    theme_minimal() +
    labs(title = paste("Average Length of ICU Stay by", demographic),
         x = demographic,
         y = "Average Length of Stay (days)")
}

# Generate summaries
summary_by_demographics <- lapply(c("insurance", "marital_status", 
                                    "gender", "anchor_age"), function(demo) {
  demographic_func(mimic_icu_cohort, demo)
})

# Assign names to the list based on demographics for easier access
names(summary_by_demographics) <- c("insurance", "marital_status", 
                                    "gender", "anchor_age")

# Create plots
plots <- list(
  plot_categorical(summary_by_demographics[["insurance"]], "insurance"),
  plot_categorical(summary_by_demographics[["marital_status"]], 
                   "marital_status"),
  plot_categorical(summary_by_demographics[["gender"]], "gender"),
  plot_continuous(summary_by_demographics[["anchor_age"]], "anchor_age")
)
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
# Print the plots with fitted width
print(plots[[1]])

print(plots[[2]])

print(plots[[3]])

print(plots[[4]])
`geom_smooth()` using formula = 'y ~ x'

Use ggplot with only four main race categories and an “Other” category for less frequent:

plot_categorical <- function(data, demographic) {
  # Collapse less frequent categories into 'Other'
  data <- data %>%
    mutate(!!demographic := ifelse(!(!!sym(demographic) %in% c("WHITE", "BLACK/AFRICAN AMERICAN", "HISPANIC OR LATINO", "ASIAN")), "OTHER", !!sym(demographic)))

  # Plot
  ggplot(data, aes_string(x = demographic, y = "mean_los", 
                          fill = demographic)) +
    geom_bar(stat = "identity", position = "dodge") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7, 
                                     face = "plain"), 
          plot.title = element_text(hjust = 0.5),
          axis.title = element_text(size = 10)) +
    labs(title = paste("Length of ICU Stay (los) vs", demographic),
         x = demographic,
         y = "Length of Stay (los)") +
    scale_fill_brewer(palette = "Paired") 
}

# Generate summary for race
race_summary <- demographic_func(mimic_icu_cohort, "race")

# Create plot for race
race_plot <- plot_categorical(race_summary, "race")

# Print the race plot
print(race_plot)

Use summary statistics to describe the average length of ICU stay by demographic variables. For example, you can calculate the mean, standard deviation, minimum, and maximum length of stay for each demographic category. Here’s an example of how you might do this using the dplyr package in R:

race_stats <- race_summary %>%
  group_by(race) %>%
  summarise(
    count = n(),
    mean = mean(mean_los, na.rm = TRUE),
    min = min(mean_los, na.rm = TRUE),
    max = max(mean_los, na.rm = TRUE)
  )

# Print the race statistics
print(race_stats)
# A tibble: 33 × 5
   race                          count  mean   min   max
   <chr>                         <int> <dbl> <dbl> <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE     1  4.46  4.46  4.46
 2 ASIAN                             1  3.49  3.49  3.49
 3 ASIAN - ASIAN INDIAN              1  4.20  4.20  4.20
 4 ASIAN - CHINESE                   1  3.36  3.36  3.36
 5 ASIAN - KOREAN                    1  4.23  4.23  4.23
 6 ASIAN - SOUTH EAST ASIAN          1  3.04  3.04  3.04
 7 BLACK/AFRICAN                     1  3.82  3.82  3.82
 8 BLACK/AFRICAN AMERICAN            1  3.28  3.28  3.28
 9 BLACK/CAPE VERDEAN                1  3.22  3.22  3.22
10 BLACK/CARIBBEAN ISLAND            1  3.61  3.61  3.61
# ℹ 23 more rows

Description: The plots show the average length of ICU stay by demographic variables. The first four plots are bar plots for categorical, The plot 1 is Average Length of ICU Stay vs Insurance by bar chart, it shows that the average length of ICU stay is similar across different insurance types. The plot 2 is Average Length of ICU Stay vs Marital Status by bar chart, it shows that the average length of ICU stay is similar across different marital status. The plot 3 is Average Length of ICU, it shows that the average length of ICU stay is similar. The plot 4 is Average Length of ICU Stay vs Age by scatter plot, it shows that the average length of ICU stay is similar across different age groups. THe plot 5 is Average Length of ICU Stay vs Race by bar chart, it shows that the average length of ICU stay is similar across on 5 race categories. And there is a summary statistics for race categories, it shows the average length of ICU stay, sd, min, and max for each category.

  • Length of ICU stay los vs the last available lab measurements before ICU stay
# Join the ICU stay data with the last lab measurements
patients_tble <- patients_tble %>%
  mutate(subject_id = as.character(subject_id))
icu_last_lab <- mimic_icu_cohort %>%
  mutate(subject_id = as.character(subject_id)) %>%
  left_join(patients_tble, by = "subject_id")

# Reshape the data to long format
data_long <- icu_last_lab %>%
  pivot_longer(cols = c('bicarbonate', 'chloride', 'creatinine', 'glucose', 
                        'potassium', 'sodium', 'hematocrit'),
               names_to = "Lab_measurements",
               values_to = "Value")

# Create a plot with different panels for each lab measurement
ggplot(data_long, aes(x = Value, y = los)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~Lab_measurements, scales = "free_x") +
  labs(title = "Los vs Last Lab Measurements",
       x = "Last Lab Measurement",
       y = "Length of Stay in ICU (Los)") +
  theme_minimal()
Warning: Removed 55592 rows containing missing values (`geom_point()`).

Description: The plot shows the length of ICU stay vs the last available lab measurements before ICU stay. The x-axis represents the last lab measurement value, and the y-axis represents the length of stay in ICU. Each panel represents a different lab measurement. The plot shows that there is no clear relationship between the last lab measurements and the length of ICU stay.

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay
# Merging
data <- merge(icustays_tble, chartevents_final_tble, 
                       by = c("subject_id", "stay_id"))
# Reshaping
data_reshaped <- data %>%
  pivot_longer(cols = c("heart_rate", 
    "non_invasive_blood_pressure_systolic", 
    "non_invasive_blood_pressure_diastolic", 
    "temperature_fahrenheit", 
    "respiratory_rate"),
               names_to = "Vital",
               values_to = "Measurement")
ggplot(data_reshaped, aes(x = Measurement, y = los)) +
  geom_point(alpha = 0.6) + 
  facet_wrap(~Vital, scales = "free_x") + 
  labs(x = "Vital Measurements",
       y = "Length of ICU Stay (los)",
       title = "ICU Stay Length vs First Vital Measurement") +
  theme_bw() + 
  theme(strip.text.x = element_text(size = 10))

Description: The plot shows the length of ICU stay vs the average vital measurements within the first hour of ICU stay. The x-axis represents the average vital sign measurement, and the y-axis represents the length of stay in ICU. Each panel represents a different vital sign. The plot shows that there is no clear relationship between the average vital measurements within the first hour of ICU stay and the length of ICU stay.

  • Length of ICU stay los vs first ICU unit
library(ggplot2)
library(stringr)

# variable transformation
mimic_icu_cohort$first_careunit <- factor(mimic_icu_cohort$first_careunit, 
                                           levels = unique(
                                             mimic_icu_cohort$first_careunit))

# Wraping
mimic_icu_cohort$first_careunit <- str_wrap(mimic_icu_cohort$first_careunit, 
                                            width = 10)

# plotting
plot <- ggplot(mimic_icu_cohort, aes(x = first_careunit, y = los)) +
  geom_boxplot() +
  labs(x = "First Care Unit",
       y = "Length of ICU Stay (los)",
       title = "Length of ICU Stay vs First Care Unit") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, vjust = 0.5)) 

print(plot)

Description: The plot shows the length of ICU stay vs the first ICU unit. The x-axis represents the first care unit, and the y-axis represents the length of stay in ICU. The plot shows that there is no clear relationship between the first care unit and the length of ICU stay. The length of ICU stay varies across different care units, but there is no consistent pattern.